Industrialization and manufacturing growth in China has been accompanied by major growth in pollution. Key pollutants include fine particulate matter (PM2.5 and PM10), nitrogen dioxide (NO2), sulfur dioxide (SO2), surface ozone (O3), and carbon monoxide (CO). The US EPA’s Air Quality Index (AQI) is one index of air quality based on these six pollutants. To attain a “good” AQI based on PM2.5, the 24-hour average has to be less than 12 μg/m^3; a 24-hour average PM2.5-level of at least 55.5 μg/m^3 is considered “unhealthy,” while levels above 250.5 μg/m3 are considered “hazardous.” The data here come from Beijing, notable for poor air quality.
The file beijing.csv contains 420,768 hourly measurements of six pollutants and weather-related factors measured at 12 monitoring stations around Beijing from March 1, 2013 through February 28, 2017. These files have not been cleaned. The data are as follows, with all pollutants measured in micrograms per cubic meter (μg/m^3).
a. We have data at an hourly scale, but there are a couple derived attributes related to date that we may find useful:
yday function in the lubridate package.format( , "%A") may be useful.(ISOdatetime()) to build a new column called as_datetime. Set the minute and hour to 0.library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
beijing <- read.csv("beijing.csv")
beijing$as_datetime = ISOdatetime(beijing$year, beijing$month, beijing$day, beijing$hour, 0, 0)
beijing$weekday <- format(beijing$as_datetime, "%A")
beijing$day_of_year <- yday(beijing$as_datetime)
b. Each measurement is associated with a weather station. The location of each station in is stations.csv. Inner join the information from that file into our air quality data.
stations <- read.csv("stations.csv")
beijing <- inner_join(beijing, stations, by=c("station"))
c. This data is currently in a wide format with respect to the pollutants (one column for each pollutant), but we would like to compare trends across pollutants. Ideally we would have figures that summarizes ALL our data, while splitting up by pollutant. This calls for a long format. Use the pivot_longer function from the tidyverse to make a new data frame with a row for each pollutant, and a column titled pollutant for the name of the pollutant and concentration for the concentration of it.
beijing_long <- pivot_longer(beijing,
c("PM2.5", "PM10", "SO2", "NO2", "CO", "O3"),
names_to="pollutant",
values_to="concentration")
a. Build a single figure that describes the distribution of the pollutants depending on the day of the week. To make your figure more informative, consider the order that the weekdays are displayed. For this first figure, do not use facet_wrap or facet_grid and instead set the x aesthetic to pollutant.
beijing_long$weekday <- factor(beijing_long$weekday,
levels=c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday",
"Sunday"))
ggplot(beijing_long, aes(pollutant, concentration, fill=weekday)) +
geom_boxplot()
## Warning: Removed 70303 rows containing non-finite values (stat_boxplot).
b. Wow that figure looks pretty rough since the pollutants concentrations vary not only across pollutants, but also within due to the huge amount of outliers! Make the same figure but use a log scale. What base of the logarithm should you use?
ggplot(beijing_long, aes(pollutant, concentration, fill=weekday)) +
geom_boxplot() +
scale_y_continuous(trans='log10')
## Warning: Removed 70303 rows containing non-finite values (stat_boxplot).
c. Now make the same graph but use facet_wrap or facet_grid. Is this figure, or the one above more informative (your opinion)?
ggplot(beijing_long, aes(y = concentration, fill=weekday)) +
geom_boxplot() +
scale_y_continuous(trans='log10') +
facet_wrap(~pollutant)
## Warning: Removed 70303 rows containing non-finite values (stat_boxplot).
d. The distributions look much easier to visualize now while not comprimising the conclusions we can draw. Does the day of the week seem to make a difference?
e. Make a similar figure with a log scale, but showing the change in pollutants by month. It is your choice if you would like to facet or not. Describe some patterns you see in the data.
ggplot(beijing_long, aes(pollutant, concentration, fill=factor(month))) +
geom_boxplot() +
scale_y_continuous(trans='log10') +
scale_fill_discrete(labels=month.abb)
## Warning: Removed 70303 rows containing non-finite values (stat_boxplot).
f. Finally make a similar figure, but on an hourly scale. How could this plot be improved?
ggplot(beijing_long, aes(pollutant, concentration, fill=factor(hour))) +
geom_boxplot() +
scale_y_continuous(trans='log10') +
ylab("Concentration")
## Warning: Removed 70303 rows containing non-finite values (stat_boxplot).
a. Beijing residents believe that the worst air quality occurs during easterly winds. Make a data frame of summary statistics describing the relative magnitude of each pollutant by wind direction. If you encounter NAs you may ignore them. Your data frame should allow someone to quickly answer the following questions:
Hint: You may find the following code chunk helpful.
# https://stackoverflow.com/questions/42597344/convert-from-compass-direcitions-to-degrees-r
wind_direction_map <- setNames(
seq(0, 337.5 , by=22.5),
c("N", "NNE", "NE", "ENE", "E", "ESE", "SE", "SSE",
"S", "SSW", "SW", "WSW", "W", "WNW", "NW", "NNW")
)
print(wind_direction_map["N"])
## N
## 0
wind_direction_info <- beijing_long %>%
drop_na(wd, pollutant) %>%
group_by(wd, pollutant) %>%
summarise(
mean_speed = mean(WSPM),
mean_value = mean(na.omit(concentration)),
angle = wind_direction_map[first(wd)]
)
## `summarise()` has grouped output by 'wd'. You can override using the `.groups` argument.
head(wind_direction_info)
## # A tibble: 6 × 5
## # Groups: wd [1]
## wd pollutant mean_speed mean_value angle
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 E CO 1.31 1579. 90
## 2 E NO2 1.31 59.7 90
## 3 E O3 1.31 47.7 90
## 4 E PM10 1.31 124. 90
## 5 E PM2.5 1.31 102. 90
## 6 E SO2 1.31 18.4 90
max_mean_by_pollutant <- wind_direction_info %>%
group_by(pollutant) %>%
summarise(max_mean = max(mean_value))
wind_direction_info <-
inner_join(wind_direction_info, max_mean_by_pollutant, by = c("pollutant"))
head(wind_direction_info)
## # A tibble: 6 × 6
## # Groups: wd [1]
## wd pollutant mean_speed mean_value angle max_mean
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 E CO 1.31 1579. 90 1621.
## 2 E NO2 1.31 59.7 90 64.1
## 3 E O3 1.31 47.7 90 95.3
## 4 E PM10 1.31 124. 90 126.
## 5 E PM2.5 1.31 102. 90 102.
## 6 E SO2 1.31 18.4 90 18.7
b. Build a single figure using polar coordinates which describes the average relative magnitude for each pollutant when grouped by wind direction. Account for wind speed in some way in your figure.
ggplot(wind_direction_info,
aes(
x = angle,
y = mean_value / max_mean,
fill = mean_speed
)) +
geom_bar(width = 22.5,
stat = "identity",
color = "white") +
scale_fill_gradient(low = "blue", high = "red") +
coord_polar(start = -pi / 16) +
theme(
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
axis.line = element_blank(),
legend.position = "bottom",
legend.key.width = unit(1.5, "cm")
) +
geom_text(aes(angle, 1.1, label = wd), color = "black", size = 2) +
ggtitle("Average of Pollutant vs. Direction") +
guides(fill = guide_colorbar(title = "Wind Speed (m/s)")) +
facet_wrap( ~ pollutant)
c. Does your visualization support the belief of Beijing residents?
Beijing experienced one of the worst air quality crises in Chinese history in January 2013, with the entire city covered in a dense gray haze visible from space. This experience led the Chinese government to immediately draft an air pollution control plan that was implemented effective September 2013. We will build a visualization similar to the one below, where the line graphs for each year are displayed on top of each other, but with the year 2013 highlighted:
a. To make our graph a little easier to build, let’s aggregate our data by day. Make another data frame that summarizes each pollutant/day pair by the mean of each pollutant across all of Beijing. Include the day of year attribute from earlier in your summary dataframe. Name your dataframe beijing_by_day.
Hint: Your new dataframe should have at least the following column names:
"year", "month", "day", "pollutant", "average_concentration", "day_of_year"
beijing_by_day <- beijing_long %>%
group_by(year, month, day, pollutant) %>%
summarise(average_concentration = mean(na.omit(concentration)),
day_of_year = first(day_of_year))
## `summarise()` has grouped output by 'year', 'month', 'day'. You can override using the `.groups` argument.
b. Because the data is so erratic, it helps to visualize the trends by using some type of rolling average. Use rollmean, rollmax, rollmedian, or rollsum from the zoo package to do some type of smoothing and make a new data frame with the smoothed data. Fill in the starter code below with your choice of rolling average and an appropriate smoothing window.
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
beijing_smoothed <- beijing_by_day %>%
arrange(day_of_year) %>%
group_by(pollutant, year) %>%
summarise(
# below line is one that needs to be filled in
rolling_average = rollmedian(average_concentration, 7,
na.pad = TRUE),
day_of_year = day_of_year
)
## `summarise()` has grouped output by 'pollutant', 'year'. You can override using the `.groups` argument.
c. Create the plot mentioned above with your smoothed data.
Hint: Use the “day of year” column defined earlier.
Bonus: Format the x-axis ticks to display an appropriate date format.
ggplot(
beijing_smoothed,
aes(
x = day_of_year,
y = rolling_average,
color = factor(year),
alpha = factor(year)
)
) +
geom_line() +
facet_wrap( ~ pollutant) +
xlab("Day of Year") +
ylab("Log of 7-day rolling median for pollutant daily mean") +
scale_y_continuous(trans = "log10") +
scale_alpha_manual(values = c(1, 0.15, 0.15, 0.15, 0.15)) +
scale_x_continuous(breaks = c(1, 91, 182, 274), labels=c("Jan", "Apr", "Jul", "Oct"))
## Warning: Removed 30 row(s) containing missing values (geom_path).
d. What visual evidence (if any) is there that the plan in 2013 was effective? Justify your choice of smoothing function and window.
a. We will now animate the data while using ggmap and gganimate. Pick a map from the ./maps folder. Justify your map choice. You can see what a map looks like with code similar to the below chunk:
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
load(file = "./maps/terrain_lines_bw.RData") # sets map variable
ggmap(map)
b. For simplicity, choose a single pollutant you would like to visualize. Set the single_frame variable below to a filtered version of the beijing data frame with a fixed year, month, day, hour, and pollutant.
single_frame <- beijing_long %>%
filter(year==2015, month==7, day==2, hour==3, pollutant=="PM2.5")
c. Use ggmap to make a figure visualizing the pollutant and hour you chose at the stations. We will do this in steps:
i. Use geom_segment to draw arrows at the locations of the given stations with the given starter code. For the aes of geom_segment:
xend and yend aesthetics to reflect the angle of the wind direction, and the speed of the wind e.g. a fast wind to the North-East would point to the top-left and be a bit longer.color aesthetic to reflect concentration. Use an appropriate color scale.map_step_1 <- ggmap(map, extent="device", legend="bottomright") +
geom_segment(
data=single_frame,
aes(x=long,
y=lat,
xend=long + cos(wind_direction_map[wd] * pi/180) * WSPM/50,
yend=lat + sin(wind_direction_map[wd] * pi/180) * WSPM/50,
color=concentration),
arrow=arrow(length=unit(0.2, "cm"))) +
scale_color_viridis_c()
ii. Use geom_voronoi and stat_voronoi to draw Voronoi cells on the map, where seed points are station locations. The cells do not need to cover the entire map.
library(ggvoronoi)
map_step_2 <- map_step_1 +
geom_voronoi(
data=single_frame,
aes(x=long, y=lat, fill=TEMP),
alpha=0.1) +
stat_voronoi(
data=single_frame,
aes(x=long, y=lat),
geom = "path") +
scale_fill_gradient(low="blue", high="red")
d. Use gganimate to animate the data for a week. Reuse your code above for each frame. Make sure you set the title to reflect the current time being shown.
library(gganimate)
one_week <- beijing_long %>%
filter(year == 2015, month == 5, day %in% 1:7, pollutant == "PM2.5")
our_animation <- ggmap(map, extent="device", legend="bottomright") +
geom_segment(
data=one_week,
aes(x=long,
y=lat,
xend=long + cos(wind_direction_map[wd] * pi/180) * WSPM/50,
yend=lat + sin(wind_direction_map[wd] * pi/180) * WSPM/50,
color=concentration),
arrow=arrow(length=unit(0.2, "cm"))) +
geom_voronoi(
data=single_frame,
aes(x=long, y=lat, fill=TEMP),
alpha=0.1) +
stat_voronoi(
data=single_frame,
aes(x=long, y=lat),
geom = "path") +
scale_fill_gradient(low="blue", high="red") +
scale_color_viridis_c() +
transition_time(as_datetime) +
labs(x="Longitude",
y="Latitude",
title = "{frame_time}",
fill="Temperature (C)",
color="Concentration (μg/m^3)")
animate(our_animation, duration = 5, fps = 6, width = 1000, height = 1000, renderer = gifski_renderer())
anim_save("output.gif")
e. Were the Voronoi cells helpful for this visualization? Why or why not?